19 research outputs found
Training Datasets for Machine Reading Comprehension and Their Limitations
Neural networks are a powerful model class to learn machine Reading Comprehen- sion (RC), yet they crucially depend on the availability of suitable training datasets. In this thesis we describe methods for data collection, evaluate the performance of established models, and examine a number of model behaviours and dataset limita- tions. We first describe the creation of a data resource for the science exam QA do- main, and compare existing models on the resulting dataset. The collected ques- tions are plausible – non-experts can distinguish them from real exam questions with 55% accuracy – and using them as additional training data leads to improved model scores on real science exam questions. Second, we describe and apply a distant supervision dataset construction method for multi-hop RC across documents. We identify and mitigate several dataset assembly pitfalls – a lack of unanswerable candidates, label imbalance, and spurious correlations between documents and particular candidates – which often leave shallow predictive cues for the answer. Furthermore we demonstrate that se- lecting relevant document combinations is a critical performance bottleneck on the datasets created. We thus investigate Pseudo-Relevance Feedback, which leads to improvements compared to TF-IDF-based document combination selection both in retrieval metrics and answer accuracy. Third, we investigate model undersensitivity: model predictions do not change when given adversarially altered questions in SQUAD2.0 and NEWSQA, even though they should. We characterise affected samples, and show that the phe- nomenon is related to a lack of structurally similar but unanswerable samples during training: data augmentation reduces the adversarial error rate, e.g. from 51.7% to 20.7% for a BERT model on SQUAD2.0, and improves robustness also in other settings. Finally we explore efficient formal model verification via Interval Bound Propagation (IBP) to measure and address model undersensitivity, and show that using an IBP-derived auxiliary loss can improve verification rates, e.g. from 2.8% to 18.4% on the SNLI test set
A Factorization Machine Framework for Testing Bigram Embeddings in Knowledgebase Completion
Embedding-based Knowledge Base Completion models have so far mostly combined
distributed representations of individual entities or relations to compute
truth scores of missing links. Facts can however also be represented using
pairwise embeddings, i.e. embeddings for pairs of entities and relations. In
this paper we explore such bigram embeddings with a flexible Factorization
Machine model and several ablations from it. We investigate the relevance of
various bigram types on the fb15k237 dataset and find relative improvements
compared to a compositional model.Comment: accepted for AKBC 2016 workshop, 6page
Crowdsourcing Multiple Choice Science Questions
We present a novel method for obtaining high-quality, domain-targeted
multiple choice questions from crowd workers. Generating these questions can be
difficult without trading away originality, relevance or diversity in the
answer options. Our method addresses these problems by leveraging a large
corpus of domain-specific text and a small set of existing questions. It
produces model suggestions for document selection and answer distractor choice
which aid the human question generation process. With this method we have
assembled SciQ, a dataset of 13.7K multiple choice science exam questions
(Dataset available at http://allenai.org/data.html). We demonstrate that the
method produces in-domain questions by providing an analysis of this new
dataset and by showing that humans cannot distinguish the crowdsourced
questions from original questions. When using SciQ as additional training data
to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Most Reading Comprehension methods limit themselves to queries which can be
answered using a single sentence, paragraph, or document. Enabling models to
combine disjoint pieces of textual evidence would extend the scope of machine
comprehension methods, but currently there exist no resources to train and test
this capability. We propose a novel task to encourage the development of models
for text understanding across multiple documents and to investigate the limits
of existing methods. In our task, a model learns to seek and combine evidence -
effectively performing multi-hop (alias multi-step) inference. We devise a
methodology to produce datasets for this task, given a collection of
query-answer pairs and thematically linked documents. Two datasets from
different domains are induced, and we identify potential pitfalls and devise
circumvention strategies. We evaluate two previously proposed competitive
models and find that one can integrate information across documents. However,
both models struggle to select relevant information, as providing documents
guaranteed to be relevant greatly improves their performance. While the models
outperform several strong baselines, their best accuracy reaches 42.9% compared
to human performance at 74.0% - leaving ample room for improvement.Comment: This paper directly corresponds to the TACL version
(https://transacl.org/ojs/index.php/tacl/article/view/1325) apart from minor
changes in wording, additional footnotes, and appendice
Complex Embeddings for Simple Link Prediction
In statistical relational learning, the link prediction problem is key to
automatically understand the structure of large knowledge bases. As in previous
studies, we propose to solve this problem through latent factorization.
However, here we make use of complex valued embeddings. The composition of
complex embeddings can handle a large variety of binary relations, among them
symmetric and antisymmetric relations. Compared to state-of-the-art models such
as Neural Tensor Network and Holographic Embeddings, our approach based on
complex embeddings is arguably simpler, as it only uses the Hermitian dot
product, the complex counterpart of the standard dot product between real
vectors. Our approach is scalable to large datasets as it remains linear in
both space and time, while consistently outperforming alternative approaches on
standard link prediction benchmarks.Comment: 10+2 pages, accepted at ICML 201
Making sense of sensory input
This paper attempts to answer a central question in unsupervised learning:
what does it mean to "make sense" of a sensory sequence? In our formalization,
making sense involves constructing a symbolic causal theory that both explains
the sensory sequence and also satisfies a set of unity conditions. The unity
conditions insist that the constituents of the causal theory -- objects,
properties, and laws -- must be integrated into a coherent whole. On our
account, making sense of sensory input is a type of program synthesis, but it
is unsupervised program synthesis.
Our second contribution is a computer implementation, the Apperception
Engine, that was designed to satisfy the above requirements. Our system is able
to produce interpretable human-readable causal theories from very small amounts
of data, because of the strong inductive bias provided by the unity conditions.
A causal theory produced by our system is able to predict future sensor
readings, as well as retrodict earlier readings, and impute (fill in the blanks
of) missing sensory readings, in any combination.
We tested the engine in a diverse variety of domains, including cellular
automata, rhythms and simple nursery tunes, multi-modal binding problems,
occlusion tasks, and sequence induction intelligence tests. In each domain, we
test our engine's ability to predict future sensor values, retrodict earlier
sensor values, and impute missing sensory data. The engine performs well in all
these domains, significantly out-performing neural net baselines. We note in
particular that in the sequence induction intelligence tests, our system
achieved human-level performance. This is notable because our system is not a
bespoke system designed specifically to solve intelligence tests, but a
general-purpose system that was designed to make sense of any sensory sequence